In this module we take small baby-steps in modeling, what is also referred to as data analytics. The goal of analytics is simple: Find the answer to some question(s). For example, why are low quality diamonds more expensive?; what influences the number of daily fights? In order to answer these types of questions we start with the usual visual explorations of the data but then eventually turn to fitting some regression-based models to the data. Let us start by understanding the basics of fitting regression models in R and then explore the flights data question.
The usual regression model can be specified as \(y = a + b(x)\) or as some prefer to write it, \(y = m(x) + c\). With a single independent variable (x) and a single dependent variable (y) we can fit the regression model in R as follows:
##
## Call:
## lm(formula = child ~ parent, data = Galton)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8050 -1.3661 0.0487 1.6339 5.9264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.94153 2.81088 8.517 <2e-16 ***
## parent 0.64629 0.04114 15.711 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.239 on 926 degrees of freedom
## Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096
## F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16
Here the data represent pairs of observations, with parent referring to the parents’ mid-height and child referring to the child’s height. In this particular example, the regression equation turns out to be child's predicted height = 23.94153 + 0.64629(parent) and indicates that on average for ever 1-inch increase in the average height of the parents the child’s height is predicted to increase by 0.64 inches, on average. We also focus on the p-value of the independent variable, and the three asterisks we see against parent *** tell us it is a statistically significant independent variable. As a rule of thumb we will want the variable to have at least one * against it before we treat it as significant. If we see a . then we basically treat the independent variable as having no impact on the dependent variable.
When we fit these models in MPA 6010, remember that we also focused on two other things – (i) the average prediction error (average error you should expect in your prediction if you use this regression model) and the (ii) adjusted R-squared (how much of the variation in the dependent variable can be explained with this model). The lower the prediction error and the higher the adjusted R-squared, the better the model (loosely speaking of course).
In R, the Residual standard error is the average prediction error, which in the present case turns out to be 2.239; out model will, on average, overpredict/underpredict by 2.239. The adjusted R-squared is 0.2096 so 20.96% of the variation in a child’s height can be “explained” with this model. That obviously means we have about 79% of the variation left to explain, and that is quite a lot so the model is not very good.
Once we fit the model we also want to generate predicted values and that can be done via the following R commands.
If you look at the Galton data you will see the predicted child height for a given average height of the parents. We can plot these predicted child heights
library(ggplot2)
ggplot(Galton, aes(x = parent, y = child)) +
geom_point() + geom_line(aes(x = parent,
y = predicted.child), color = "red")Fair enough, but what if the independent variable were categorical? Let us see how the model works in that case, and the data we will use here is the diamonds data. I am going to flip cut into a regular factor since it is stored as an ordered factor in the data-frame.
##
## Fair Good Very Good Premium Ideal
## 1610 4906 12082 13791 21551
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z cut.f
## Min. : 0.000 Min. : 0.000 Fair : 1610
## 1st Qu.: 4.720 1st Qu.: 2.910 Good : 4906
## Median : 5.710 Median : 3.530 Very Good:12082
## Mean : 5.735 Mean : 3.539 Premium :13791
## 3rd Qu.: 6.540 3rd Qu.: 4.040 Ideal :21551
## Max. :58.900 Max. :31.800
##
##
## Call:
## lm(formula = price ~ cut.f, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4258 -2741 -1494 1360 15348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4358.76 98.79 44.122 < 2e-16 ***
## cut.fGood -429.89 113.85 -3.776 0.000160 ***
## cut.fVery Good -377.00 105.16 -3.585 0.000338 ***
## cut.fPremium 225.50 104.40 2.160 0.030772 *
## cut.fIdeal -901.22 102.41 -8.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3964 on 53935 degrees of freedom
## Multiple R-squared: 0.01286, Adjusted R-squared: 0.01279
## F-statistic: 175.7 on 4 and 53935 DF, p-value: < 2.2e-16
How many categories of cut.f are there? Five, and the regression shows you estimates for only four because the fifth and final category is absorbed into the (Intercept). We see the average prediction error to be 3964 and the adjusted R-squared to be 0.01279 (only 1.279% of the variation in price is explained by this model so it is not a very good model). Now we interpret the estimates.
Note the difference in how we interpret the results if it is a continuous independent variable versus when we have a categorical variable. For continuous variables the estimate is just how much the dependent variable changes by as the independent variable increases by 1. For a categorical independent variable we have to use the Intercept to calculate the predicted value of the dependent variable for a particular category of the independent variable.
What happens if we have more than one independent variable? Well, in that case we no longer have a bivariate regression model but instead we have a multiple regression model.
Let us build such a model with the diamonds data-set. I’ll use two independent variables – cut.f and carat.
##
## Call:
## lm(formula = price ~ cut.f + carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17540.7 -791.6 -37.6 522.1 12721.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3875.47 40.41 -95.91 <2e-16 ***
## cut.fGood 1120.33 43.50 25.75 <2e-16 ***
## cut.fVery Good 1510.14 40.24 37.53 <2e-16 ***
## cut.fPremium 1439.08 39.87 36.10 <2e-16 ***
## cut.fIdeal 1800.92 39.34 45.77 <2e-16 ***
## carat 7871.08 13.98 563.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1511 on 53934 degrees of freedom
## Multiple R-squared: 0.8565, Adjusted R-squared: 0.8565
## F-statistic: 6.437e+04 on 5 and 53934 DF, p-value: < 2.2e-16
First, notice that the average prediction error has dropped to 1511 and the adjusted R-squared has gone up to 0.8565 (i.e., this model explains 85.65% of the variation in price and hence is a much better model than what we had before).
The (Intercept) is still referring to a Fair cut but because we have a continuous independent variable, carat, we will have to pick a value for carat and then generate the predicted price for a Fair cut diamond. The value one should pick for a continuous variable should be the median. In this case, we saw median carat to be 0.7000. Now we can write the regression equation and calculate predicted price for a fair cut 0.7 carat diamond as
\[predicted.price = -3875.47 + 7871.08(0.7000) = 1634.286\]
If it were a Good cut diamond of the same carat value then the predicted price would be
\[predicted.price = -3875.47 + 1120.33 + 7871.08(0.7000) = 2754.616\]
and similarly for an Ideal diamond with the same carat value,
\[predicted.price = -3875.47 + 1800.92 + 7871.08(0.7000) = 3435.206\]
diamonds dataLet us generate box-plots of price by cut
Notice that all cuts have outliers and the median price is the highest for Fair diamonds. We can see the actual prices if we calculate the medians.
## # A tibble: 5 x 2
## cut Median.Price
## <ord> <dbl>
## 1 Fair 3282
## 2 Good 3050.
## 3 Very Good 2648
## 4 Premium 3185
## 5 Ideal 1810
Why is that? Because the weight of the diamonds is very critical in determining price and some of the lower quality diamonds tend to be the highest.
Notice that most diamonds are in the less than 2.5 carat range. Let us focus on just this group, which makes up some 99.7% of the data, and then repeat the geom_hex plot. .
diamonds2 <- diamonds %>% filter(carat <=
2.5)
ggplot(diamonds2, aes(carat, price)) + geom_hex(bins = 50)This plot shows a non-linear relationship between carat and price, a relationship that is easy to see if we draw a line through what seems to be the middle of the distribution.
Notice the line is not straight. For linear models, the relationship should be linear and one way of achieving that is by transforming the variables. There are several ways to do this transformation but taking the logarithm is the easiest way of doing so. We carry out the transformation and then plot again.
diamonds2 = diamonds2 %>% mutate(log.price = log(price),
log.carat = log(carat))
ggplot(diamonds2, aes(log.carat, log.price)) +
geom_hex(bins = 50) + geom_smooth(color = "red")Much better, and now we can fit a regression model to the data.
##
## Call:
## lm(formula = log.price ~ log.carat, data = diamonds2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36139 -0.17016 -0.00585 0.16587 1.34114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.452142 0.001364 6194.5 <2e-16 ***
## log.carat 1.681371 0.001936 868.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2611 on 53812 degrees of freedom
## Multiple R-squared: 0.9334, Adjusted R-squared: 0.9334
## F-statistic: 7.542e+05 on 1 and 53812 DF, p-value: < 2.2e-16
If we wanted to, we could also throw in the other variables to see if we can boost the model’s performance.
diamonds2$color.f = factor(diamonds2$color,
ordered = FALSE)
diamonds2$clarity.f = factor(diamonds2$clarity,
ordered = FALSE)
mod_diamond2 <- lm(log.price ~ log.carat +
cut.f + color.f + clarity.f, data = diamonds2)
summary(mod_diamond2)##
## Call:
## lm(formula = log.price ~ log.carat + cut.f + color.f + clarity.f,
## data = diamonds2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81367 -0.08621 -0.00065 0.08263 1.92918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.878793 0.005824 1352.76 <2e-16 ***
## log.carat 1.886239 0.001124 1677.81 <2e-16 ***
## cut.fGood 0.079667 0.003879 20.54 <2e-16 ***
## cut.fVery Good 0.116721 0.003609 32.34 <2e-16 ***
## cut.fPremium 0.139116 0.003570 38.97 <2e-16 ***
## cut.fIdeal 0.160826 0.003539 45.45 <2e-16 ***
## color.fE -0.054404 0.002103 -25.87 <2e-16 ***
## color.fF -0.095218 0.002126 -44.78 <2e-16 ***
## color.fG -0.160878 0.002082 -77.29 <2e-16 ***
## color.fH -0.250963 0.002210 -113.56 <2e-16 ***
## color.fI -0.372398 0.002477 -150.36 <2e-16 ***
## color.fJ -0.511213 0.003061 -166.99 <2e-16 ***
## clarity.fSI2 0.408518 0.005248 77.84 <2e-16 ***
## clarity.fSI1 0.572443 0.005216 109.75 <2e-16 ***
## clarity.fVS2 0.721966 0.005243 137.71 <2e-16 ***
## clarity.fVS1 0.792239 0.005319 148.94 <2e-16 ***
## clarity.fVVS2 0.927682 0.005475 169.45 <2e-16 ***
## clarity.fVVS1 0.999500 0.005626 177.65 <2e-16 ***
## clarity.fIF 1.094550 0.006071 180.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1328 on 53795 degrees of freedom
## Multiple R-squared: 0.9828, Adjusted R-squared: 0.9828
## F-statistic: 1.706e+05 on 18 and 53795 DF, p-value: < 2.2e-16
Notice the adjusted R-squared is at 0.9828 so this model is explaining some 98.28% of the variation in price, leaving very little variation in price to be explained.
We can engage this question with the NYC flights data that is found in the nycflights13 data-set.
We start by calculating the number of daily flights on a given day and then plotting this frequency.
library(lubridate)
daily <- flights %>% mutate(date = make_date(year,
month, day)) %>% group_by(date) %>% summarise(n = n())
daily## # A tibble: 365 x 2
## date n
## <date> <int>
## 1 2013-01-01 842
## 2 2013-01-02 943
## 3 2013-01-03 914
## 4 2013-01-04 915
## 5 2013-01-05 720
## 6 2013-01-06 832
## 7 2013-01-07 933
## 8 2013-01-08 899
## 9 2013-01-09 902
## 10 2013-01-10 932
## # ... with 355 more rows
Hmm, this is not an easy plot to read to pick up on any marked patterns but what if broke it out by the day of the week? That might helps us see if the number of flights vary by what day of the week it is. We can do it via box-plots.
daily <- daily %>% mutate(wday = wday(date,
label = TRUE))
ggplot(daily, aes(wday, n)) + geom_boxplot()So Sundays and Saturdays have the fewest flights, not surprising because most flights are for business purposes and so you are more likely to leave on Sunday for a Monday meeting than on a Saturday (which is why Saturdays have the fewest flights on average).
Let us now fit a regression model to these data, trying to predict the number of flights from the day of the week. Note that the (Intercept) is referring to Sundays.
daily$wday.f = factor(daily$wday, ordered = FALSE)
lm.wday <- lm(n ~ wday.f, data = daily)
summary(lm.wday)##
## Call:
## lm(formula = n ~ wday.f, data = daily)
##
## Residuals:
## Min 1Q Median 3Q Max
## -331.75 -8.69 12.31 25.19 112.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 891.481 6.767 131.742 < 2e-16 ***
## wday.fMon 83.327 9.570 8.707 < 2e-16 ***
## wday.fTue 59.878 9.525 6.287 9.44e-10 ***
## wday.fWed 71.212 9.570 7.441 7.48e-13 ***
## wday.fThu 74.269 9.570 7.761 8.92e-14 ***
## wday.fFri 75.981 9.570 7.940 2.64e-14 ***
## wday.fSat -146.865 9.570 -15.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.8 on 358 degrees of freedom
## Multiple R-squared: 0.7178, Adjusted R-squared: 0.7131
## F-statistic: 151.8 on 6 and 358 DF, p-value: < 2.2e-16
If a regression model is capturing most of the variation in the dependent variable, the residuals should be essentially pure noise, i.e., reflecting no systematic pattern whatsoever. What is the residual? You fit a model, generate predicted values of the dependent variable, and then if you deduct the predicted value \((\hat{y})\) from the actual value \((y)\) you will have the residuals. Let us do this next.
library(modelr)
options(na.action = na.warn)
grid <- daily %>% data_grid(wday.f) %>% add_predictions(lm.wday,
"n")
grid## # A tibble: 7 x 2
## wday.f n
## <fct> <dbl>
## 1 Sun 891.
## 2 Mon 975.
## 3 Tue 951.
## 4 Wed 963.
## 5 Thu 966.
## 6 Fri 967.
## 7 Sat 745.
The grid is a simple table with predicted number of flights for each day of the week. The plot we have drawn shows you the actual number of flights per day of the week through the year as box-plots, and the predicted number of flights as the red dot, one for each day, pulled from the values in grid.
Now we calculate and plot the residuals.
daily <- daily %>% add_residuals(lm.wday)
daily %>% ggplot(aes(date, resid)) + geom_ref_line(h = 0) +
geom_line()Notice the geom_ref_line(h = 0) drawn where the residual is 0 … which would be the case if predicted and actual number of flights were identical (since then \(y - \hat{y} = 0\)). If we see negative residuals, that would only occur when we overpredict (since then \(\hat{y}\) will exceed \(y\)). If we see positive residuals then we are underpredicting (since then \(y\) will exceed \(\hat{y}\)). These residuals are important because they are representing other patterns in the data, patterns that are not due to the day of the week since that effect was already modeled in the regression.
Why are we seeing huge residuals on certain dates and not others? Let us see what results if we break this plot down by the day of the week.
What you are looking for is whether the residuals are generally close to the zero reference line across the year or if they tend to cross above or below zero for long periods. The one day that shows this deviation is Saturday.
Note, before we move on, that the largest residuals are all negative and seem to occur in June/July, September/October, and November/December. Which dates have the largest residuals? We can find out quite easily.
## # A tibble: 11 x 5
## date n wday wday.f resid
## <date> <int> <ord> <fct> <dbl>
## 1 2013-11-28 634 Thu Thu -332.
## 2 2013-11-29 661 Fri Fri -306.
## 3 2013-12-25 719 Wed Wed -244.
## 4 2013-07-04 737 Thu Thu -229.
## 5 2013-12-24 761 Tue Tue -190.
## 6 2013-12-31 776 Tue Tue -175.
## 7 2013-09-01 718 Sun Sun -173.
## 8 2013-05-26 729 Sun Sun -162.
## 9 2013-07-05 822 Fri Fri -145.
## 10 2013-01-01 842 Tue Tue -109.
## 11 2013-01-20 786 Sun Sun -105.
So November 28 and 29, followed by December 25, the fourth of July, and so on. Why are the residuals so large and negative! Because our model does not include specific flags for holidays or other special occasions and hence ends up predicting far more flights on these days than the actual number of flights on these days.
We can dig into the mystery of poor predictions for Saturday flights by focusing purely on this day of the week. As we do this we can plot the actual number of flights across the year to see the actual variation in the number of daily flights.
daily %>% filter(wday == "Sat") %>% ggplot(aes(date,
n)) + geom_point() + geom_line() + scale_x_date(NULL,
date_breaks = "1 month", date_labels = "%b")Note the scale_x_date() command; it allows us to create breaks by month and to have abbreviated month names show up on the x-axis. Now, if the number of flights varied randomly, we would see very little pattern in the data but instead it is obvious that Saturday flights increase in February through April, then dip, then start back up again through the middle of August, and then up again towards the end of November through December. Some of these increases are during the holiday season, and then in the summer months when vacation travel rises. This would become more obvious if we created indicators for school terms.
term <- function(date) {
cut(date, breaks = ymd(20130101, 20130605,
20130825, 20140101), labels = c("spring",
"summer", "fall"))
}
daily <- daily %>% mutate(term = term(date))
daily %>% filter(wday == "Sat") %>% ggplot(aes(date,
n, colour = term)) + geom_point(alpha = 1/3) +
geom_line() + scale_x_date(NULL, date_breaks = "1 month",
date_labels = "%b")Aha! Notice the Summer increase. But do flights increase for all days of the week in the Summer? To answer that question we would have to plot daily flights by day of the week, color by term, and use a box-plot.
If term had no impact, the three box-plots for each term for a given day of the week would be very similar but that is not the case. The box-plots are least symmetric for Spring and Fall has several outliers. It might be worth it to include term as an independent variable in the regression model and see if that helps generate a better model.
mod1 <- lm(n ~ factor(wday, ordered = FALSE),
data = daily)
mod2 <- lm(n ~ factor(wday, ordered = FALSE) +
factor(term, ordered = FALSE), data = daily)
summary(mod1)##
## Call:
## lm(formula = n ~ factor(wday, ordered = FALSE), data = daily)
##
## Residuals:
## Min 1Q Median 3Q Max
## -331.75 -8.69 12.31 25.19 112.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 891.481 6.767 131.742 < 2e-16 ***
## factor(wday, ordered = FALSE)Mon 83.327 9.570 8.707 < 2e-16 ***
## factor(wday, ordered = FALSE)Tue 59.878 9.525 6.287 9.44e-10 ***
## factor(wday, ordered = FALSE)Wed 71.212 9.570 7.441 7.48e-13 ***
## factor(wday, ordered = FALSE)Thu 74.269 9.570 7.761 8.92e-14 ***
## factor(wday, ordered = FALSE)Fri 75.981 9.570 7.940 2.64e-14 ***
## factor(wday, ordered = FALSE)Sat -146.865 9.570 -15.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.8 on 358 degrees of freedom
## Multiple R-squared: 0.7178, Adjusted R-squared: 0.7131
## F-statistic: 151.8 on 6 and 358 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = n ~ factor(wday, ordered = FALSE) + factor(term,
## ordered = FALSE), data = daily)
##
## Residuals:
## Min 1Q Median 3Q Max
## -325.61 -6.57 12.35 23.68 118.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 882.087 7.049 125.138 < 2e-16
## factor(wday, ordered = FALSE)Mon 83.327 9.122 9.135 < 2e-16
## factor(wday, ordered = FALSE)Tue 60.055 9.079 6.615 1.37e-10
## factor(wday, ordered = FALSE)Wed 70.562 9.123 7.735 1.08e-13
## factor(wday, ordered = FALSE)Thu 73.620 9.123 8.070 1.09e-14
## factor(wday, ordered = FALSE)Fri 75.332 9.123 8.257 2.95e-15
## factor(wday, ordered = FALSE)Sat -147.515 9.123 -16.169 < 2e-16
## factor(term, ordered = FALSE)summer 37.663 6.378 5.905 8.25e-09
## factor(term, ordered = FALSE)fall 3.903 5.544 0.704 0.482
##
## (Intercept) ***
## factor(wday, ordered = FALSE)Mon ***
## factor(wday, ordered = FALSE)Tue ***
## factor(wday, ordered = FALSE)Wed ***
## factor(wday, ordered = FALSE)Thu ***
## factor(wday, ordered = FALSE)Fri ***
## factor(wday, ordered = FALSE)Sat ***
## factor(term, ordered = FALSE)summer ***
## factor(term, ordered = FALSE)fall
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.51 on 356 degrees of freedom
## Multiple R-squared: 0.745, Adjusted R-squared: 0.7393
## F-statistic: 130 on 8 and 356 DF, p-value: < 2.2e-16
We do see a slight improvement in the adjusted R-squared from 0.7131 to 0.7393. We could stop here or keep digging into the data but doing so will require us to rely on more complicated models that are beyond the scope of this class. Nevertheless, we have done well in terms of analyzing these data and identifying two things –
So if you were told what day of the week it is and what term it is, you would be able to generate a reasonable prediction about how many flights should be expected on that given day.
term you used the Month as an independent variable in the model, would you end up with a better model in terms of an average prediction error smaller than 48.4 and a higher adjusted R-squared than 0.7131? Show your work.